This loan data is from a lending company called Prosper. It was last updated 3/11/2014. It has 113937 records with 81 features, namely
ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,EstimatedEffectiveYield,EstimatedLoss,EstimatedReturn,ProsperRating (numeric),ProsperRating (Alpha),ProsperScore,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,CurrentlyInGroup,GroupKey,DateCreditPulled,CreditScoreRangeLower,CreditScoreRangeUpper,FirstRecordedCreditLine,CurrentCreditLines,OpenCreditLines,TotalCreditLinespast7years,OpenRevolvingAccounts,OpenRevolvingMonthlyPayment,InquiriesLast6Months,TotalInquiries,CurrentDelinquencies,AmountDelinquent,DelinquenciesLast7Years,PublicRecordsLast10Years,PublicRecordsLast12Months,RevolvingCreditBalance,BankcardUtilization,AvailableBankcardCredit,TotalTrades,TradesNeverDelinquent (percentage),TradesOpenedLast6Months,DebtToIncomeRatio,IncomeRange,IncomeVerifiable,StatedMonthlyIncome,LoanKey,TotalProsperLoans,TotalProsperPaymentsBilled,OnTimeProsperPayments,ProsperPaymentsLessThanOneMonthLate,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,ScorexChangeAtTimeOfListing,LoanCurrentDaysDelinquent,LoanFirstDefaultedCycleNumber,LoanMonthsSinceOrigination,LoanNumber,LoanOriginalAmount,LoanOriginationDate,LoanOriginationQuarter,MemberKey,MonthlyLoanPayment,LP_CustomerPayments,LP_CustomerPrincipalPayments,LP_InterestandFees,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,InvestorsPlease click on the following link to see the definitions of the data's features. Prosper Data Dictionary to Explain Dataset's Variables
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
# pld will be short for prosperLoanData
pld = pd.read_csv('prosperLoanData.csv')
# show all of the columns
pd.options.display.max_columns = None
pld.head()
pld.info()
pld.shape
pld.describe()
#features_corr = ['Term', 'LoanStatus', 'BorrowerRate', 'LenderYield', 'ProsperRating (numeric)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'AvailableBankcardCredit', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanOriginalAmount', 'MonthlyLoanPayment']
features_corr = ['Term', 'LoanStatus', 'BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'InquiriesLast6Months', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'AvailableBankcardCredit', 'DebtToIncomeRatio', 'IncomeRange', 'StatedMonthlyIncome', 'LoanOriginalAmount']
temp2_df = pld[features_corr]
corr_big = temp2_df.corr()
corr_big
pld.EmploymentStatus.value_counts()
print(list(pld.columns))
pld.ListingKey.nunique()
# there are 113937 records and 113066 unique listing keys. I will verify quickly if these are duplicates.
key_counts = pld.ListingKey.value_counts()
key_counts
key_counts.value_counts()
dup_idx = pld[pld.ListingKey.duplicated()]['ListingKey']
print(type(dup_idx))
print(dup_idx[:10])
tt = dup_idx.tolist()
tt
len(tt)
dup_listing_idx = set(dup_idx.tolist())
len(dup_listing_idx)
According to Prosper variable definitions, ListingKey is supposed to be UNIQUE. dup_listing_idx are the listing keys that have duplicates. I will remove their duplicates but keep the first. I will then assign a -1 in their ProsperScore to mark them as to be "corrected"
sub_cols = ['ProsperScore', 'ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']
# examine one ListingKey and see in which column the listings differ
check_dups_df = pld[pld.ListingKey == '17A93590655669644DB4C06']
# make sure that it is NOT spacing that makes a difference in non-duplicates
check_dups_df = check_dups_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
#duplicaterows = check_dups_df[check_dups_df.duplicated(sub_cols)]
#temp_df.to_csv('dupes.csv', index=None)
check_dups_df
temp_cols = ['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']
# Exclude the column ProsperScore, then see if the duplicates are dropped
check_dups_df.drop_duplicates(subset=temp_cols, inplace=True, keep='first')
check_dups_df
The rows above were identical except in column ProsperScore. Although initially since they were inconsistently assigned, I thought they should be dropped. I think now I will take the opportunity to see if I can correct these "errors"
pld.corr()
# there is some strange inconsistency in way prosper give prosperscore.
# for example, the loan with listingkey 17A93590655669644DB4C06 has exactly the data in all the other columns
# EXCEPT for prosperScore which had 4, 8, 7, 10, 5, 6 in content. BIZARRE!!
# For that reason, I will drop the column prosperscore, and then drop_duplicates
len(temp_cols)
#tpld = pld.drop(['ProsperScore'], axis=1)
tpld = pld.copy()
tpld.shape
# trim spaces again to make sure it is not spaces that make a difference
tpld = tpld.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
Before I remove the duplicated rows, I would like to record the range of prosperScores that the listing keys were assigned
# dup_listing_idx contains the keys that have duplicates: tpld.loc[dup_listing_idx]
lk_ps_df = tpld[tpld.ListingKey.isin(dup_listing_idx)][['ListingKey', 'ProsperScore']]
lk_ps_df = lk_ps_df.sort_values(by=['ListingKey', 'ProsperScore'])
lk_ps_df.count()
# create a new column ProsperScoreRange and set index to ListingKey
lk_ps_df['ProsperScoreRange'] = None
lk_ps_df = lk_ps_df.set_index(['ListingKey'])
# create a dictionary with key as ListingKey and the ProsperScore range as value
psr_dict = {}
for lk in dup_listing_idx:
psr_dict[lk] = lk_ps_df.loc[lk]['ProsperScore'].tolist()
list(psr_dict.items())[:10]
# place all the new range value from the dictionary into dataframe
for k, v in psr_dict.items():
vstr = [str(i) for i in v] # convert the list into strings
yy = ",".join(vstr)
lk_ps_df.loc[k, 'ProsperScoreRange'] = yy
lk_ps_df.head(10)
# reset the index then drop the column ProsperScore from lk_ps_df
lk_ps_df.reset_index(inplace=True)
lk_ps_df = lk_ps_df.drop(['ProsperScore'], axis=1)
lk_ps_df.count()
# now remove all the duplicates on ListingKey
lk_ps_df.drop_duplicates(subset=['ListingKey'], inplace=True, keep='first')
lk_ps_df.count()
The count above matches the count of dup_listing_idx
# quickly examine the listing keys in my working dataframe tpld
# the temp_cols list is the list of columns without ProsperScore
duplicaterows = tpld[tpld.duplicated(temp_cols)]
duplicaterows.ListingKey.value_counts()
# backup tpld
tpld_backup = tpld.copy()
# drop all the duplicates in tpld based on temp_cols subset (i.e., without ProsperScore)
tpld.drop_duplicates(subset=temp_cols, inplace=True, keep='first')
# merge the two df tpld and lk_ps_df on ListingKey
tpld = pd.merge(tpld, lk_ps_df, on=['ListingKey'], how='left')
tpld.head()
new_key_counts = tpld.ListingKey.value_counts()
new_key_counts.value_counts()
tpld.shape
tpld.ProsperScore.value_counts()
tpld['ProsperRating (Alpha)'].value_counts()
tpld['ProsperRating (numeric)'].value_counts()
tpld['CreditGrade'].value_counts()
ProsperRating alpha and numeric match exactly in count. CreditGrade has the same letter grades as ProsperRating. Prosper's variable definitions did not specify if the grading system in ProsperRating and CreditGrade. I am hoping that upon analysis below that it confirms that they are the same grading system.
pld.shape[0] - tpld.shape[0]
#select_features = ['EmploymentStatus', 'LoanOriginalAmount', 'CreditScoreRangeUpper', 'LoanStatus', 'BorrowerRate', 'StatedMonthlyIncome', 'ProsperRating (numeric)', 'ListingCategory (numeric)', 'IsBorrowerHomeowner']
select_features = ['ListingKey', 'EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditGrade', 'ProsperRating (numeric)', 'ProsperScore', 'ProsperScoreRange', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit']
df = tpld[select_features]
df.info()
df.shape
# There is a lot of nulls. what to do?
# drop listingnumber, borrowerapr (seems similar to borrowerrate), Investmentfromfriendscount, investmentfromfriendsamount,
# percentfunded (nearly 100%)
df.corr()
df.describe()
df.isnull().sum()
print(list(df.columns))
The original dataset had (113937, 81). After discovering that Prosper had inconsistencies in using the feature ProsperScore, I added a new column called ProsperScoreRange. I then dropped all the duplicated rows but keeping the first among the duplicated. The new shape is (113937, 82). With the selected features, the final shape is (113066, 14)
ProsperRating and BorrowerRate features are interesting. I would like to see if it's predictable based on some features. The BorrowerRate is highly correlated to ProsperRating so I believe BorrowerRate is derivative. I would like to keep track of these 2 features based on the features below.
Initially, I was interested in predicting ProsperRating. It is however highly correlative to BorrowerRate. But some samples do not have ProsperRating but they all have BorrowerRate. Ultimately if this is a automation or machine learning problem. If I choose to predict ProsperRating, it would be a classification problem. If I choose BorrowerRate, it would be a regression problem. Since it is not the purpose of this project, I will leave the automation problem for a future task.
I will attempt to correct the ProsperScore on those index that had duplicates in exploration.
Since ProsperRating and CreditGrade are mutually exclusive, I will also combine them in a new column. They both have BorrowerRate. So I will examine if they correlate with BorrowerRate.
I will examine in bivariate section how CreditGrade's relationship to BorrowerRate compares with that of PropserRating.
I will investigate the following features.
'ListingKey', 'EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditGrade', 'ProsperRating (numeric)', 'ProsperScore', 'ProsperScoreRange', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit'
And keep track of these 2 features against the features above. 'BorrowerRate', 'ProsperRating (numeric)'
I retained the ListingKey so I can continue to process below.
df = df.rename(columns={'ProsperRating (numeric)': 'ProsperRating'})
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
# set a base color palette
pleasant = sb.color_palette()[0]
#fig, ax = plt.subplots(figsize=(10,8))
#df.EmploymentStatus.plot(kind='bar', ax=ax)
#plt.show();
plt.figure(figsize=(10,8))
sb.countplot(data = df, x='EmploymentStatus', color=pleasant)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Employment Status Count');
Although I will compare in bivariate analysis later the relationship between EmploymentStatus and EmloymentStatusDuration, I will do them here one-by-one first.
Nearly all of the borrowers are employed. Very few borrowers have "Not employed" status. It would be interesting to see what other factors Prosper used to extend loans to the unemployed borrowers.
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='EmploymentStatusDuration', color=pleasant, bins=100)
plt.xlabel('Employment Duration in months')
plt.title('Employment Duration in months')
It is extremely right-skewed. Most people worked below 100 months in their current employment status
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='EmploymentStatusDuration', color=pleasant, bins=100)
plt.xscale('log')
plt.xlabel('Employment Duration in months')
plt.title('Employment Duration in months');
I am not sure if this transformation elucidated the right-skewedness of the distribution.
sb.countplot(data=df, x='ProsperRating', color=pleasant)
plt.title('ProsperRating Count');
ProsperRating is normally distributed.
sb.boxplot(data=df, x='ProsperRating')
plt.title('Another view of the obvious');
This strongly confirms the normal distribution.
sb.countplot(data=df, x='ProsperScore', color=pleasant)
plt.title('ProsperScore Count')
Both ProsperScore and ProsperRating seem to be ordinal. It is not clear to me whether bigger score is better or worse. I will examine further. However, they both seem to have a normal distribution. The Prosper Loan Dictionary did not specify which one is best.
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='BorrowerRate', color=pleasant)
plt.title('Distribution of BorrowerRate');
This distribution is all over the place. There wild peaks around .15, .25 and .35, and a especially huge one around .32.
plt.figure(figsize=(10,8))
sb.distplot(x=df.BorrowerRate, color=pleasant)
plt.title('Distribution of Borrower Rate with kde');
There's a lot of humps here. multimodal. It seems most loans had borrower rate between .1 and .2. However, there are also many between .2 and .3 and a really huge hump just above .3.
fig, ax = plt.subplots(figsize=(12,10))
#,
df.hist(['LoanOriginalAmount', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'CurrentCreditLines', 'AvailableBankcardCredit'], ax=ax);
This is a quick series of distribution plots:
LoanOriginalAmount are below 10k, with bumps at increments of 5000.CreditScoreRangeUpper range from 450 to 900CurrentDelinquencies below 10.CurrentCreditLines between 5 and 15.AvailableBankcardCredit below 10kplt.figure(figsize=(10,8))
sb.countplot(data=df, x='CurrentCreditLines', color=pleasant)
plt.xticks(rotation=90)
plt.title('Number of Current Credit Lines');
Before 20, CurrentCreditLines looks to have a normal distribution.
plt.figure(figsize=(10,8))
sb.countplot(data=df, x='CurrentCreditLines', color=pleasant)
plt.xlim(0,20)
plt.xticks(rotation=90)
plt.title('Number of Current Credit Lines');
It is slightly right-skewed with the mean just slightly higher than the median. There are some big outliers. It looks almost normal when I limited the x-axis.
print(f"median: {df.CurrentCreditLines.median()}, mean: {df.CurrentCreditLines.mean()}")
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='LoanOriginalAmount', color=pleasant, bins=100)
plt.title('Distribution of Loan Amounts');
As I suspected above, the loan amounts had peaks around increments of 5000.
plt.figure(figsize=(10,8))
sb.distplot(x=df.LoanOriginalAmount, color=pleasant)
plt.title('Distribution of Loan Amount with kde');
I do not know what to make of this yet. The bumps look like the bank preferred to lend in increments of 5000. Big bump at 5000, then bumps at 10000, 15000, 20000, and 25000 (perhaps even at 30000 and 35000)
sb.countplot(data=df, x='IsBorrowerHomeowner', color=pleasant);
There are about as many homeowners than not. This is nearly uniform.
df.IsBorrowerHomeowner.value_counts()
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='CreditScoreRangeUpper', color=pleasant, bins=100)
plt.xlim(625,800)
plt.title('Distribution of Credit Scores');
The credit score distribution looks almost normal with the mean and the median at around 700.
I will assume that CreditGrade has the same legend as ProsperRating which is
The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
I will create a numeric column based on CreditGrade
df.CreditGrade.value_counts()
pr_dict = {'NC': 0, 'HR': 1, 'E': 2, 'D': 3, 'C': 4, 'B': 5, 'A': 6, 'AA': 7}
df.CreditGrade.replace(pr_dict, inplace=True)
plt.figure(figsize=(10,8))
sb.countplot(data = df, x='CreditGrade', color=pleasant)
#plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Credit Grade Count');
If we discount 0 rating, CreditGrade looks to be almost normal.
df[['CreditGrade', 'ProsperRating']].describe()
If you discount, the 0 rating it looks normal. And the distribution is comparable to ProsperRating. This confirms that the alphabetic rating used in ProsperRating is the same as the alphabetic rating used in CreditGrade.
My variable of interest the BorrowerRate since all the records have that. BorrowerRate has a lot of peaks. ProsperRating has a normal distribution. ProsperScore is not as normal as ProsperRating. Nevertheless, I will do some more cleaning below after I do some bivariate analysis. I attempted to do a transformation on employment duration. It confirmed that most borrowers worked at least 2 years.
As I mentioned that there were some ListingKeys that had duplicates that only differed in their ProsperScore content. I believe these were clerical errors. I kept one of each duplicated Listing. But I created a new column called ProsperScoreRange so that I can examine the "incorrect" ProsperScores after I impute values into ProsperScore based on ProsperRating.
I converted the CreditGrade column to the numerical rating described in the description of ProsperRating in the Prosper's variable definitions. If you discount the 0 rating, CreditGrade looks normal. And the distribution is comparable to ProsperRating. This confirms that the alphabetic rating used in ProsperRating is the same as the alphabetic rating used in CreditGrade.
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
# Now I will create a new column combining ProsperRating and CreditGrade since they are mutually exclusive
# But first I would like to confirm that CreditGrade's relationship to BorrowerRate is similar to that of ProsperRating.
sb.catplot(data=df, x='CreditGrade', y='BorrowerRate', color=pleasant, alpha=0.1);
The relationship density between BorrowerRate and CreditGrade trends negatively.
sb.catplot(data=df, x='ProsperRating', y='BorrowerRate', color=pleasant, alpha=0.1);
Similar to above, the relationship density between BorrowerRate and ProsperRating trends negatively.
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)
sb.regplot(x='ProsperRating', y='BorrowerRate', ax=ax1, data=df)
sb.regplot(x='CreditGrade', y='BorrowerRate', ax=ax2, data=df)
Side-by-side, it is clearer that they are similary correlated to BorrowerRate. It warrants that they be combine into one column. Reminder that they are mutually exclusive CreditGrade (pre mid 2009), ProsperRating (post mid 2009).
sb.heatmap(df.corr(), fmt='.2f', annot=True, cmap='RdBu_r')
ProsperRating and ProsperScore are positively correlative
ProsperScore and ProsperRating are highly negatively correlative with BorrowerRate. Since all loans have a BorrowerRate, I believe that these 3 features are co-determined based on all the other features.
The heatmap confirms that ProsperRating and CreditGrade are in fact mutually exclusive. Their graphs also are comparable agains BorrowerRate. I will now combine ProsperRating and CreditGrade in one column.
df[['CreditGrade', 'ProsperRating']].notnull().sum()
df['ProsperRatingGrade'] = df['ProsperRating']
df.ProsperRatingGrade.update(df['CreditGrade'])
df.ProsperRatingGrade.isnull().sum()
All 113066 records accounted for.
df[['CreditGrade', 'ProsperRating', 'ProsperRatingGrade']].sample(1000).head(20)
I successfully combined the CreditGrade and ProsperScore. I will now do more bivariate analysis
sb.pairplot(df[['EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'ProsperRatingGrade', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit']])
This is not so readable. But this quickly gives me hints on how to plot the relationships better.
sb.catplot(data=df, x='ProsperRatingGrade', y='BorrowerRate', color=pleasant, alpha=0.1);
ProsperRatingGrade is the column that combines CreditGrade and ProsperRating. Obviously, the density and trend of relationship of each of those features against BorrowerRate.
sb.lmplot(data=df.sample(11000), x='ProsperRatingGrade', y='BorrowerRate', x_jitter=.05)
This shows that the higher the ProsperRatingGrade the lower BorrowerRate.
sb.catplot(data=df, x='ProsperRating', y='ProsperScore', color=pleasant, alpha=0.009)
Is it justifiable to use ProsperRating to correct ProsperScore? The density hints at a positive correlation.
sb.lmplot(data=df.sample(11000), x='ProsperRating', y='ProsperScore')
The line shows a correspondence between ProsperRating and ProsperScore.
# what is the average score in each rating
# get a temporary df without the dup IDs
temp_df = df[~df.ListingKey.isin(dup_listing_idx)]
temp_df.shape
fig, ax = plt.subplots(figsize=(10,8))
#plt.figure(figsize=(10,8))
sb.lineplot(data=temp_df, x='ProsperRating', y='ProsperScore', marker='o')
ax.set_yticks(np.arange(11))
plt.title('ProsperScore against ProsperRating');
The graph did not really change much. But I will use the opportunity to "correct" ProsperScore based on ProsperRating to get to know the data better
grptemp = temp_df.groupby('ProsperRating')
grptemp.mean()
gm = grptemp.mean()
print(gm.ProsperScore.apply(np.ceil))
grptemp.median()
I will use the mean scores obtained above to "correct" the ProsperScore in those duplicated indices with varying ProsperScore. Although I do not believe that this corrects the ProsperScore, I will use it for now. I will try to confirm by other means if my simple "correction" based on ProsperRating approximates truth. I did save the original ProsperScores in a new column called ProsperScoreRange.
# work with grouped mean dataframe and reset index
# then create a dictionary with ProsperRating as key and ProsperScore as value from mean() dataframe
gm = gm.reset_index() # necessary to use ProsperRating again as a column
p_score = dict(zip(gm.ProsperRating, gm.ProsperScore.apply(np.ceil)))
p_score
#df = work_df.copy()
work_df = df.copy() # just another backup of working dataframe
# with working dataframe df, for each Listing key in dup_listing_idx
# change the ProsperScore to the corresponding one in the ps_score dictionary
# I tried the declarative way below. But I experimented too much. I will use procedural way instead.
#work_df.loc[work_df.ListingKey.isin(dup_listing_idx), 'ProsperScore'] = work_df.ProsperScore.map(p_score)
for idx in dup_listing_idx:
pr = float(df[df.ListingKey==idx]['ProsperRating']) # get the ProsperRating to use for dict lookup
df.loc[(df.ListingKey==idx), 'ProsperScore'] = p_score.get(pr)
#peek_df = work_df[work_df.ListingKey.eq('09233589620788733CFB8CE')]
peek_df = df[df.ListingKey.isin(dup_listing_idx)]
peek_df[['CreditGrade', 'ProsperRating', 'ProsperScore', 'ProsperRatingGrade', 'ProsperScoreRange']].head(20)
plt.figure(figsize=(10,8))
sb.lineplot(data=df, x='ProsperRating', y='ProsperScore')
plt.title('Correlation between ProsperRating and ProsperScore (with ProsperScore corrections)');
Again this is showing the same lineplot but with ProsperScore corrections. The visuals should not have changed since I am using ProsperRating to correct ProsperScore.
sb.lmplot(data=df.sample(10000), x='ProsperRating', y='BorrowerRate', x_jitter=.05)
#sb.pairplot(df, kind='reg', plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
I have seen this plot already but as regplot. However, this lmplot is computationally expensive even with a fraction of the data.
t_cols = list(df.columns)
print(t_cols)
t_cols = ['BorrowerRate', 'LoanOriginalAmount', 'CreditScoreRangeUpper', 'ProsperRatingGrade']
sb.pairplot(df[t_cols].sample(10000), kind='reg', plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
ProsperRatingGrade is positively correlated with CreditScoreRangeUpper.LoanOriginalAmount is positively with CreditScoreRangeUppersb.catplot(data=df, x='ProsperRatingGrade', y='BorrowerRate', color=pleasant, alpha=0.1)
As I mentioned before it didn't make a difference in the big picture.
I drew the graph first with the line and then a form of a scatterplot indicating density of points. It is clear to me now that ProsperRating is ordinal. The higher the ProsperRating the lower the BorrowerRate.
temp_df = df[['BorrowerRate', 'ProsperRatingGrade']]
temp_df.isnull().sum()
temp_df = temp_df.dropna()
temp_df.isnull().sum()
temp_df.shape
sb.lmplot(data=temp_df.sample(10000), x='ProsperRatingGrade', y='BorrowerRate', x_jitter=.05, line_kws={'color':'red'})
plt.title('Borrower Rate against Prosper Rating');
I used a red line to show better the trend between ProsperRatingGrade and BorrowerRate.
sb.catplot(data=df, x='EmploymentStatus', y='EmploymentStatusDuration', color=pleasant, alpha=0.1)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Employment Duration vs Status');
I thought there might be a relationship between emploment status and duration. I don't see much.
sb.catplot(data=df, x='EmploymentStatus', y='BorrowerRate', color=pleasant, alpha=0.1)
plt.xticks(rotation=45, horizontalalignment='right');
With the relationship more dense in the first 5 statuses, Prosper had a preference to lend to the employed. However, the rates varied just as widely regardless of employment status.
plt.figure(figsize=(10,8))
sb.boxplot(data=df, x='EmploymentStatus', y='BorrowerRate', color=pleasant)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Borrower rates among employment statuses');
As mentioned above, the rates did vary regardless of employment status. But what is clear here is that the employed enjoyed lower rates on average.
df.EmploymentStatus.value_counts()
there's a definite range of rates given to the first 5 statuses.
sb.heatmap(df.corr(), fmt='.2f', annot=True, cmap='RdBu_r')
CreditGrade and ProsperRating is quite evident.ProsperRating and CreditGrade hints at independence.plt.figure(figsize=(10,8))
sb.lmplot(data=df.sample(frac=.25), x='LoanOriginalAmount', y='BorrowerRate', line_kws={'color': 'red'});
LoanOriginalAmount increases.plt.figure(figsize=(10,8))
sb.lmplot(data=df.sample(frac=.25), x='CreditScoreRangeUpper', y='BorrowerRate', line_kws={'color': 'red'});
BorrowerRate.df.CreditScoreRangeUpper.isnull().sum()
df.CreditScoreRangeUpper.describe()
# computationally expensive using sample
sb.lmplot(x="BorrowerRate", y="IsBorrowerHomeowner", data=df.sample(10000), logistic=True, y_jitter=.03);
It looks as if they disbursed funds regardless of homeownership. However, it benefits that you are a homeowner in that the BorrowerRate is lower if you are a homeowner
plt.figure(figsize=(10,8))
sb.lmplot(data=df.sample(frac=.25), x='CurrentCreditLines', y='BorrowerRate', line_kws={'color': 'red'});
BorrowerRate.# ProsperRatingGrade vs homeownership
plt.figure(figsize =(10,8))
sb.countplot(data=df, x='ProsperRatingGrade', hue='IsBorrowerHomeowner');
Homeownership seemed to matter less as the ProsperRatingGrade got higher. Prosper Rating seemed to matter more
plt.figure(figsize=(10,8))
sb.boxplot(data=df, x='CreditGrade', y='BorrowerRate', color=pleasant)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('BorrowerRate against CreditGrade');
Here is another look at CreditGrade vs BorrowerRate. The mean and the quartiles of BorrowerRate trend down as rating increases.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=2)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()
g.fig.suptitle('Relationship between Loan Amount and BorrowerRate broken down by Prosper Rating Grade');
It looks that the higher Prosper Rating Grade have the lowest range of borrower rates.
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'CreditScoreRangeUpper', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between Credit Score and BorrowerRate broken down by Prosper Rating Grade');
Credit Score made more of a difference in the big picture with borrowers with Prosper Rating grade of 4 or below.
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'AvailableBankcardCredit', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()
g.fig.suptitle('Relationship between Available Bank Credit and BorrowerRate broken down by Prosper Rating Grade');
Interestingly, people with Prosper Rating of 5 or higher had a negatively correlationship between their rate and available bank credit. With those with rating of 4 or below, the relationship between rate and available credit is positively correlated. The higher available bank credit the lower the borrower rate. But if the borrower's rating is low, the higher the borrower rate is as bank credit increases. Again this might have the effect of other features which are causing the rating to lower.
# see the relationship of credit scores broken down # currentCreditLines CurrentDelinquencies
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'CurrentDelinquencies', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between Available Bank Credit and BorrowerRate broken down by Prosper Rating Grade');
CurrentDelinquiencies seem to have an effect on BorrowerRate. Generally, the more deliquencies the higher the BorrowerRate. Interestingly, at low grade, the BorrowerRate seems to go lower. I think this is just the effect of having low ProsperRating.
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'CurrentCreditLines', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between number of credit lines and BorrowerRate broken down by Prosper Rating Grade');
Broken down by prosper rating, the trend between BorrowerRate and number of credit lines are generally flat. However, at a rating of 1, the relationship between BorrowerRate and CurrentCreditLines is positive, i.e., at that low rating, the more credit lines the higher the rate. Interesting!
g = sb.FacetGrid(data=df, height=6, col='CreditGrade', col_wrap=4)
g.map(sb.regplot, 'CreditScoreRangeUpper', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between pre-2009 credit grade and BorrowerRate broken down by Grade');
This graph and the graph below are probably the most revealing during this whole investigation. This is BorrowerRate against credit scores broken down by each CreditGrade. This is where I discovered that CreditGrade is tied very closely to CreditScoreRangeUpper. Note the range of credit scores per grade.
g = sb.FacetGrid(data=df, height=6, col='ProsperRating', col_wrap=4)
g.map(sb.regplot, 'CreditScoreRangeUpper', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between post-2009 Prosper Rating and BorrowerRate broken down by Grade');
This is extremely interesting. The pre-2009 CreditGrade is very closely tied to the CreditScoreRangeUpper. On the other hand, post-2009 was less dependent on Credit Score.
This puts to question of combining CreditGrade and ProsperRating into one column.
However, upon looking at range of BorrowerRate, the density are comparable.
| Grade | CreditGrade | ProsperRating |
|---|---|---|
| 0 | 0 | 0 |
| 1 | .24 - .26 | .32 |
| 2 | .22 - .32 | .28 |
| 3 | .21 | .24 - .26 |
| 4 | .16 and .18 | .18 - .22 |
| 5 | .14 | .16 |
| 6 | .1 - .15 | .12 |
| 7 | .07 - .1 | .07 |
#fig, ax = plt.subplots(1, 2, figsize=(12, 10))
#sb.boxplot(x='CreditGrade', y='BorrowerRate', data=df, ax=[1, 1])
#sb.boxplot(x='ProsperRating', y='BorrowerRate', data=df, ax=[1, 2])
# create a boolean mask IsCG IsPR
# create a column IsPR
wdf = df.copy()
wdf['IsPR'] = np.where(wdf.ProsperRating.notnull(), 1, None) # column IsPR
wdf['IsCG'] = np.where(wdf.CreditGrade.notnull(), 2, None) # column
wdf['post2009'] = wdf['IsPR']
wdf.post2009.update(wdf['IsCG'])
wdf.post2009.isnull().sum()
wdf[['IsPR', 'ProsperRating', 'IsCG', 'CreditGrade', 'post2009']].sample(1300).head(20)
wdf = wdf.dropna(subset=['post2009'])
wdf.isnull().sum()
wdf.post2009.replace(2, 0, inplace=True)
wdf[['IsPR', 'ProsperRating', 'IsCG', 'CreditGrade', 'post2009']].sample(1300).head(20)
plt.figure(figsize=(10,8))
sb.boxplot(x='ProsperRatingGrade', y='BorrowerRate', hue='post2009', data=wdf)
plt.title('Does it work');
BorrowerRate against both CreditGrade (pre-2009) and ProsperRating (post-200BorrowerRate trend downward as rating increases.BorrowerRate.plt.figure(figsize=(10,8))
sb.boxplot(x='ProsperRatingGrade', y='CreditScoreRangeUpper', hue='post2009', data=wdf)
plt.title('Does it work');
It is clear that not one factor determines the BorrowerRate or the ProsperRatingGrade.
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTMLmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!